Introduction
The task is to provide the detail insights into the different marketing campaigns in a market.
a). Overview of entire market’s development and the different campaigns and summarize the most important findings.
b). Assess the development of the quality of traffic. How is the overall development and how does each campaign evolve?
c). Advice a business developer who wants to spend an additional 250€ per week from week 31 onwards in a market campaign.
• Advice a business developer in which campaign to invest and why?
• How does this impact the overall performance in the market from week 31 onwards?
Marketing campaign dataset
# Import the dataset
df <- read.csv("marketing_campaigns.csv",header=TRUE,stringsAsFactors = F, sep=";")
summary(df)
## Week Campaign Visits Revenue
## Min. : 1.00 Length:91 Min. : 27.0 Min. : 2.269
## 1st Qu.: 8.00 Class :character 1st Qu.:144.0 1st Qu.:122.557
## Median :16.00 Mode :character Median :158.0 Median :241.206
## Mean :15.66 Mean :213.8 Mean :236.743
## 3rd Qu.:23.00 3rd Qu.:233.0 3rd Qu.:357.413
## Max. :30.00 Max. :613.0 Max. :463.249
## Cost
## Min. : 3.764
## 1st Qu.:128.521
## Median :235.322
## Mean :241.537
## 3rd Qu.:348.419
## Max. :507.522
From the above summary we can see that the minimum visits are 27 and maximum are 613. Similarly the minimum revenue is 2.269, mean is 236.743 and the maximum is 463.249. Minimum cost is 3.764, mean cost is 241.537 and maximum is 507.522. The marketing dataset shows the campaigning data of 30 weeks. The dataset consists of 5 variables, Week, Campaign, Visits, Revenue and Cost.
Problem Statement (a): Overview of entire market’s development and the different campaigns.
# Inspect the dataset
str(df)
## 'data.frame': 91 obs. of 5 variables:
## $ Week : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Campaign: chr "Aldebaran" "Aldebaran" "Aldebaran" "Aldebaran" ...
## $ Visits : int 27 64 80 93 120 130 146 173 170 218 ...
## $ Revenue : num 2.27 10.82 7.13 11.09 14.28 ...
## $ Cost : num 3.76 15.32 10.75 16.91 21.45 ...
# Shows the dimensions
dim(df)
## [1] 91 5
# shows the variable names
names(df)
## [1] "Week" "Campaign" "Visits" "Revenue" "Cost"
# shows the top 6 obs
head(df)
# shows the last 6 obs
tail(df)
It looks like there is a duplicate in the 91st row for week 30. We will check this but let’s see if there is any missing data as well.
# let's check for the missing data
is.na(df)
## Week Campaign Visits Revenue Cost
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE FALSE
## [43,] FALSE FALSE FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE FALSE FALSE
## [45,] FALSE FALSE FALSE FALSE FALSE
## [46,] FALSE FALSE FALSE FALSE FALSE
## [47,] FALSE FALSE FALSE FALSE FALSE
## [48,] FALSE FALSE FALSE FALSE FALSE
## [49,] FALSE FALSE FALSE FALSE FALSE
## [50,] FALSE FALSE FALSE FALSE FALSE
## [51,] FALSE FALSE FALSE FALSE FALSE
## [52,] FALSE FALSE FALSE FALSE FALSE
## [53,] FALSE FALSE FALSE FALSE FALSE
## [54,] FALSE FALSE FALSE FALSE FALSE
## [55,] FALSE FALSE FALSE FALSE FALSE
## [56,] FALSE FALSE FALSE FALSE FALSE
## [57,] FALSE FALSE FALSE FALSE FALSE
## [58,] FALSE FALSE FALSE FALSE FALSE
## [59,] FALSE FALSE FALSE FALSE FALSE
## [60,] FALSE FALSE FALSE FALSE FALSE
## [61,] FALSE FALSE FALSE FALSE FALSE
## [62,] FALSE FALSE FALSE FALSE FALSE
## [63,] FALSE FALSE FALSE FALSE FALSE
## [64,] FALSE FALSE FALSE FALSE FALSE
## [65,] FALSE FALSE FALSE FALSE FALSE
## [66,] FALSE FALSE FALSE FALSE FALSE
## [67,] FALSE FALSE FALSE FALSE FALSE
## [68,] FALSE FALSE FALSE FALSE FALSE
## [69,] FALSE FALSE FALSE FALSE FALSE
## [70,] FALSE FALSE FALSE FALSE FALSE
## [71,] FALSE FALSE FALSE FALSE FALSE
## [72,] FALSE FALSE FALSE FALSE FALSE
## [73,] FALSE FALSE FALSE FALSE FALSE
## [74,] FALSE FALSE FALSE FALSE FALSE
## [75,] FALSE FALSE FALSE FALSE FALSE
## [76,] FALSE FALSE FALSE FALSE FALSE
## [77,] FALSE FALSE FALSE FALSE FALSE
## [78,] FALSE FALSE FALSE FALSE FALSE
## [79,] FALSE FALSE FALSE FALSE FALSE
## [80,] FALSE FALSE FALSE FALSE FALSE
## [81,] FALSE FALSE FALSE FALSE FALSE
## [82,] FALSE FALSE FALSE FALSE FALSE
## [83,] FALSE FALSE FALSE FALSE FALSE
## [84,] FALSE FALSE FALSE FALSE FALSE
## [85,] FALSE FALSE FALSE FALSE FALSE
## [86,] FALSE FALSE FALSE FALSE FALSE
## [87,] FALSE FALSE FALSE FALSE FALSE
## [88,] FALSE FALSE FALSE FALSE FALSE
## [89,] FALSE FALSE FALSE FALSE FALSE
## [90,] FALSE FALSE FALSE FALSE FALSE
## [91,] FALSE FALSE FALSE FALSE FALSE
There is no missing data. Now let’s check for the duplicates.
# let's check for the duplicates
duplicated(df)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
There is a missing data found in the 91st row for week 30th Cottington campaign. We need to get rid of that.
df <- df[-91,] # remove the last row 91st
tail(df) # Let's check again
1. Visualizing the dataset to get some insight into the campaigns for 30 weeks using the package Plotly. Plotly generates interactive charts. If we move the cursor on the chart, we can see the results.
There are features like zoom in and out, box select, select and deselect the variables, download etc.
NOTE: To see the values please move the cursor in the chart.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
1.1 Visualize the visits per week for each campaign
p <- plot_ly(df, x = ~Week, y=~Visits, color = ~Campaign)
p
1. Aldebaran Campaign
The plot shows that the Aldebaran campaign had only 27 visits in the first week, 64 visits in the second week which is more than the double. Later, This campaign showed a tremendous increase in the visits and has attracted more visitors to the campaign i.e 613 during last week (30th week).
2. Bartledan Campaign
Bartledan had 128 visits during the first week which is very high as compare to Aldebaran. During the third week, it came down to 99 and then very slowly started increasing and got only 247 visits during the last week.
3. Cottington campaign
The Cottington campaign had 149 visits in the first week, which is higher than the rest two Aldebaran and Bartledan. In the last week, the visits decreased to 125 which is the lowest amongst all.
1.2 Visualize the cost spent on each campaign
p <- plot_ly(df, x = ~Campaign, y=~Cost, color = ~Campaign)
p
From the above chart we can see that the cost spent on the cottington campaign is very high followed by Bartledan and Aldebaran. The total cost spent on Aldebaran is very low as compare to the other two campaigns.
1.3 Visualize the revenue generated by each campaign
p <- plot_ly(df, x = ~Campaign, y=~Revenue, color = ~Campaign)
p
From the above chart we can see that the Cottington Campaign has generated more revenue followed by Bartledan and Aldebaran.
1.4 Visualize the profit for each campaign, but first, we have to calculate the profit.
Profit = Revenue - Cost
prof <- df %>%
mutate(Profit = Revenue - Cost)
# Check which campaign got the maximum and minimum profit
p <- plot_ly(prof, x = ~Campaign, y=~Profit, color = ~Campaign)
p
The profit chart shows that the Cotttington Campaign got the highest profit amongst all.
Aldebaran got some profit in the end but it is almost no profit no loss.
The Bartledan is in loss and the worst amongst all.
Problem Statement (b): How would you assess the development of the quality of traffic, e.g. in terms of revenue per visitor. How is the overall development and how does each campaign evolve?
To answer this we have to create few metrics and visualizations. We will calculate Revenue Per Visitor, Cost Per Visitor, Profit Per Visitor and Overall Campaign Profit. This will give us a clear picture of the campaign.
1.5 Calculate Revenue Per Visitor
As we can see that the Cottington campaign has generated maximum revenue of 2.85 per visitor followed by Bartledan and Aldebaran at 1.65 and 0.42. When we visualized the visits per week for each campaign, we saw that the Cottington had 149 visits in the first week, which was higher than the rest two Aldebaran and Bartledan. In the last weeks, the visits decreased to 125 which was the lowest among all. Even after less number of visitors, Cottington campaign got the highest RPV amongst all the three campaigns.
The revenue per visitor doesn’t show us the clear picture of the campaign, the cost spent on the campaign is also a very important factor. As we have already seen above the cost spent on Cottington and Bartledan was very high.
1.6 Calculate the Cost Per Visitor.
Now we will calculate the cost per visitor in the same way we calculated revenue per visitor.
# Calculate RPV using the formaula RPV = Revenue / Visits
cpv <- prof %>%
mutate(CPV = Cost / Visits)
# Visualize Weekly Revenue Per Visitor using Line Chart
p <- plot_ly(cpv, x = ~Week, y=~CPV, color = ~Campaign, type = 'scatter',
mode = 'lines')
p
As we can see in the above plot that the cost spent per visitor is very high for Cottington (3.11) and Bartledan (1.78). The cost spent per visitor for Aldebaran is very low as compare to the other two and that is 0.34. This metric tells us what is the cost spent to attract one visitor. This should be as low as possible to generate more profit per visitor.
1.7 Calculate Profit Per Visitor.
# Calculate RPV using the formaula RPV = Revenue / Visits
ppv <- prof %>%
mutate(PPV = Profit / Visits)
# Visualize Weekly Revenue Per Visitor using Line Chart
p <- plot_ly(ppv, x = ~Week, y=~PPV, color = ~Campaign, type = 'scatter',
mode = 'lines')
p
Now we can see clearly the insight of the development in terms of traffic which is totally different from the Revenue Per Visitor. The plot above shows that there is actually no profit generated by Cottington and Bartledan, instead, they are in loss. The Aldebaran campaign generated the profit after the 20th week. The Aldebaran generated the highest profit of 0.07 per visitor which is far better than the other two campaigns. The plot shows that the Bartledan is at loss of 0.13 per visitor and Bartledan is at loss of 0.26 per visitor.
1.8 Let’s calculate the Overall Campaign Profit
over_profit <- rpv%>%
select(Week, Campaign, Revenue, Cost) %>%
pivot_longer(cols = Revenue:Cost, names_to ="CostRevenue" ,values_to = "Value") %>%
group_by(Campaign,CostRevenue) %>%
summarise(Value = round(sum(Value),1))
over_profit
# Plot the overall profit
p <- plot_ly(over_profit, x = ~Campaign, y=~Value, color = ~CostRevenue)
p
The overall campaign profit shows that the cottington generated the highest profit. Barteladen is in loss and the Aldebaran is almost no profit no loss.
The overall summary of the campaign.
Aldebaran:
The Aldebaran started with the loss. Aldebaran campaign had very low visits in the first week, but in the second week, it was more than double.
The revenue generated per visitor increased by 2.5 folds from week 1 to 30.
After week 20 this campaign showed a tremendous increase in the visits and has attracted more visitors to the campaign and in the end it had the best quality of traffic.
Bartledan:
Bartledan had 128 visits during the first week. During the third week, it went down to 99 and then very slowly started increasing and got only 247 visits during the last week.
The cost spent on this campaign was the second-highest and according to the Profit Per Visitor, it started the campaign with the highest loss amongst all and continued to same until week 26. Even after 26 weeks, it couldn’t generate much profit.
The revenue generated per visitor increases by nearly 1.6 times from week 1 to week 30.
The least loss it incurred was in the 4th week of 7.28. This campaign was the worst of all.
Cottington.
The Cottington campaign had the highest visits of 149 in the first week. In the last week, the visits decreased to 125 which was the lowest amongst all. Even after less number of visitors, Cottington campaign got the highest RPV amongst all the three campaigns.
The increase in revenue per visitor was steady from week 1 to 29, but suddenly after week 30, there was an unusual increase in the revenue that was due to the unusual increase in the cost spent.
The revenue generated per visitor increases by nearly 1.5 times from week 1 to 30.
Cottington was the only campaign that generated the highest profit from week 1 but it started felling down gradually and generated the most loss per week.
After week 20 it couldn’t generate profits.
1.9 Advice in which campaign to invest and why?
Based on our findings I would advise the business developer to invest in Aldebaran campaign because if an investor spends more in the Aldebaran campaign, it will generate more profit. The cost spent on Aldebaran was very low and still, it generated some profit in the end. If the investor will spend more in Aldebaran it will definitely generate more profit.
2.0 Session Data
## Avoidance of scientific numbers
options(scipen = 999)
# Import the session dataset
df_sessions <- read.csv("session_data.csv",header=TRUE,stringsAsFactors = F, sep=",")
summary(df_sessions)
## session session_start_text session_end_text clickouts
## Min. :20170500000000 Length:10000 Length:10000 Min. :0.000
## 1st Qu.:20170500000000 Class :character Class :character 1st Qu.:2.000
## Median :20170500000000 Mode :character Mode :character Median :2.000
## Mean :20170500000000 Mean :2.485
## 3rd Qu.:20170500000000 3rd Qu.:3.000
## Max. :20170500000000 Max. :8.000
## booking
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.0967
## 3rd Qu.:0.0000
## Max. :1.0000
Data Exploration and Preparation
# Inspect the dataset
str(df_sessions)
## 'data.frame': 10000 obs. of 5 variables:
## $ session : num 20170500000000 20170500000000 20170500000000 20170500000000 20170500000000 ...
## $ session_start_text: chr "06:11:53" "21:06:41" "12:03:01" "05:58:00" ...
## $ session_end_text : chr "06:15:11" "21:08:23" "12:06:02" "06:02:56" ...
## $ clickouts : int 3 3 3 0 1 4 3 2 2 4 ...
## $ booking : int 0 0 0 0 0 0 0 1 0 0 ...
# I think we need to change the time format, its a character here.
dim(df_sessions)
## [1] 10000 5
# shows the top 6 obs
head(df_sessions)
# shows the last 6 obs
tail(df_sessions)
2.1 Now we will calculate the time duration of a session in minutes (between session start and session end)
df_sessions <- df_sessions %>%
mutate(duration =
round(as.numeric(( session_end_text - session_start_text)/60)))
head(df_sessions)
2.2 We can see that the column duration is generated and showing the time duration of a session in minutes.
Let’s count the number of bookings for 0 and 1
df_sessions %>%
group_by(booking) %>%
summarise(count = n())
2.3 Let’s count the number of clickouts
df_sessions %>%
group_by(clickouts) %>%
summarise(count = n())
As we can see there are mostly are 2 and 3 clickouts.
2.4 Let’s count the durations
df_sessions %>%
group_by(duration) %>%
summarise(count = n())
The duration above shows negative values, it looks like a bug or R has interpreted it in a wrong way. After analyzing the problem and after long research I found out that we need to convert the minutes to seconds by adding 86400 seconds (86400 number of seconds in 24 hours), which will give us the number of seconds instead of minutes.
For example
-1438*60 = -86280
86400-86280 = 120 seconds (2 minutes)
df_sessions$duration[df_sessions$duration==-1438] = 2
df_sessions$duration[df_sessions$duration==-1437] = 3
df_sessions$duration[df_sessions$duration==-1436] = 4
df_sessions$duration[df_sessions$duration==-1435] = 5
2.5 Now we can check if it’s showing correct or not
df_sessions %>%
group_by(duration) %>%
summarise(count = n())
we can see that its showing correct values now.
2.6 Visualize the session dataset for the distribution of booking with clickouts and duration using ggplot.
df_sessions %>%
group_by(booking, clickouts) %>%
summarise(count = n()) %>%
ggplot() + geom_bar(aes(x = clickouts, y = count,
fill = booking), stat = "identity") +
guides(fill = guide_legend(title = "Booking")) +
facet_grid(. ~ booking) +
labs(x = "clickouts", y = "booking",
title = "Booking w.r.t Clickouts")

As we can see in the above plot the booking data is normally distributed with respect to clickouts. We can also see that most of the bookings are done in 2 and 3 clickouts.
2.7 Now let’s check for the time duration.
df_sessions %>%
group_by(booking, duration) %>%
summarise(count = n()) %>%
ggplot() + geom_bar(aes(x = duration, y = count,
fill = booking), stat = "identity") +
guides(fill = guide_legend(title = "Booking")) +
facet_grid(. ~ booking) +
labs(x = "Session Minutes", y = "Booking",
title = "Booking w.r.t Duration")

This plot also shows that the booking data is normally distributed concerning time duration. We can see that most of the bookings are done in 3 minutes.
Now we can say that there is a relation between booking and duration and booking and clickouts as well. We can do the Hypothesis test to see the relationship between them, let’s see if we can prove the hypothesis test.
2.8 But first let’s check the correlation between these variables using the correlation matrix. Let’s see what the correlation matrix says before doing the hypothesis.
## Generate the correlation matrix
cor <- cor(df_sessions %>% select(booking, clickouts, duration))
cor
booking clickouts duration
booking 1.00000000 -0.04981168 0.01131174
clickouts -0.04981168 1.00000000 0.04017565
duration 0.01131174 0.04017565 1.00000000
plot(cor)

The correlation plot above shows that there is no relation between them. That’s interesting to see, when we visualized the distribution plot it showed a relation between them and now in a correlation matrix plot, it shows there is no relation at all. It would be interesting to do the hypothesis test to check and verify what it is, is there any relation or not.
2.9 Hypothesis Test
In our case the null hypothesis(H0) and the alternative hypothesis(H1) would be:
H0: There is no relation between Booking and (duration/clickouts).
H1: We fail to reject the assumption that there is a relation between Booking and (duration/clickouts).
model = glm(booking ~ duration + clickouts, data = df_sessions
, family = binomial(link = "logit"))
summary(model)
##
## Call:
## glm(formula = booking ~ duration + clickouts, family = binomial(link = "logit"),
## data = df_sessions)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5741 -0.4651 -0.4395 -0.4137 2.3807
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.97323 0.12382 -15.936 < 0.0000000000000002 ***
## duration 0.04230 0.03138 1.348 0.178
## clickouts -0.16154 0.03214 -5.027 0.000000499 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6355.4 on 9999 degrees of freedom
## Residual deviance: 6328.7 on 9997 degrees of freedom
## AIC: 6334.7
##
## Number of Fisher Scoring iterations: 5
The P-Value for the duration is 0.178 which is very large i.e greater than 0.05. The P-value should be smaller than alpha and the value of alpha is 0.05 which means we fail to reject the null hypothesis. So it proves that there is no relation between booking and duration.
The P-value for clickouts is 0.000000499 (4.99*10^-7) and we already said above that it should be smaller than alpha value i.e 0.05. In this case it is smaller than alpha. So we reject the null hypothesis and we can say that we have sufficient evidence to prove that there is a relation between the booking and clickouts.